Appendix: Exploring reported robberies on Mexico City’s transportation system with R

In this Appendix I will lay out in detail the process I followed for Exploring reported robberies on Mexico City’s transportation system. The main data set I use here is the Mexico City data on street-level crimes includes all of the crimes reported to this agency starting in January. This data is updated monthly on the Mexico City Open Data Portal.

I also use the shape-files for Mexico City’s neighborhoods, metro lines, metro stations, metrobus lines and metrobus stations.

R packages

To analyze this data I used the following R packages from the Tidyverse:readr, dplyr, forcats, ggplot2, lubridate stringr, and sf.

library(readr) # reading in data
library(dplyr) # data manipulation
library(forcats) # recoding variables
library(ggplot2) # visualization
library(lubridate) # date-time object manipulation
library(stringr) # strings and regular expressions
library(zoo) # moving average functions
library(sf) # working with spatial objects

Research questions

Before jumping into the data, I brainstormed a set of questions to guide this analysis:

  • What kinds of crimes are committed on Mexico City’s transportation system (both private cars and public transportation)?
  • How do the crimes reported vary by mode?
  • Have reported crimes increased or decreased over time? Does this vary by mode?
  • What are the months, days of the week and hours when crimes are more frequently committed on different modes of transportation?
  • For crimes committed on the Metro and Metrubus what are the most common stations committed?
  • For other modes, what neighborhoods are the most dangerous?

The data

I read in the data and do some initial checks to get familiar with this data set:

#read in data
crimes <- read_csv("../data/carpetas-de-investigacion-pgj-de-la-ciudad-de-mexico_2020-02-22 00_02.csv")

To start getting a feel for the data I look at the first few columns of data:

#Visualize first few columns
head(crimes)
## # A tibble: 6 x 18
##   ao_hechos mes_hechos fecha_hechos        delito categoria_delito fiscalia
##       <dbl> <chr>      <dttm>              <chr>  <chr>            <chr>   
## 1      2018 Marzo      2018-03-31 18:30:00 VIOLE… DELITO DE BAJO … INVESTI…
## 2      2018 Marzo      2018-03-31 12:40:00 PERDI… HECHO NO DELICT… INVESTI…
## 3      2018 Abril      2018-04-01 00:00:00 NARCO… DELITO DE BAJO … INVESTI…
## 4      2018 Marzo      2018-03-31 23:50:00 CONTR… DELITO DE BAJO … INVESTI…
## 5      2018 Abril      2018-04-01 01:10:00 LESIO… DELITO DE BAJO … INVESTI…
## 6      2018 Febrero    2018-02-07 15:15:00 FALSI… DELITO DE BAJO … INVESTI…
## # … with 12 more variables: agencia <chr>, unidad_investigacion <chr>,
## #   alcaldia_hechos <chr>, colonia_hechos <chr>, ao_inicio <dbl>,
## #   mes_inicio <chr>, fecha_inicio <dttm>, calle_hechos <chr>,
## #   calle_hechos2 <chr>, longitud <dbl>, latitud <dbl>, geopoint <chr>

I check for NAs in each column:

#count NAs i all columns
crimes %>% sapply( function(x) sum(is.na(x)))
##            ao_hechos           mes_hechos         fecha_hechos 
##                  389                  389                  389 
##               delito     categoria_delito             fiscalia 
##                    0                    0                    0 
##              agencia unidad_investigacion      alcaldia_hechos 
##                    0                  225                 3086 
##       colonia_hechos            ao_inicio           mes_inicio 
##                37057                    0                    0 
##         fecha_inicio         calle_hechos        calle_hechos2 
##                    0                 2505               558332 
##             longitud              latitud             geopoint 
##                36285                36285                36285

The data contains the following columns:

# Get column names
colnames(crimes)
##  [1] "ao_hechos"            "mes_hechos"           "fecha_hechos"        
##  [4] "delito"               "categoria_delito"     "fiscalia"            
##  [7] "agencia"              "unidad_investigacion" "alcaldia_hechos"     
## [10] "colonia_hechos"       "ao_inicio"            "mes_inicio"          
## [13] "fecha_inicio"         "calle_hechos"         "calle_hechos2"       
## [16] "longitud"             "latitud"              "geopoint"
  • Ao_hechos: Year of crime. (typo in column name)
  • mes_hechos: Month of crime.
  • fecha_hechos: Date of crime.
  • delito: Crime.
  • categoria_delito: Crime category.
  • fiscalia: Prosecution office.
  • agencia: Agency.
  • unidad_investigacion: Investigation Unit
  • alcaldia_hechos: Municipality whee crime was committed.
  • colonia_hechos: Neighborhood where crime was committed.
  • ao_inicio : Year when investigation started
  • mes_inicio: Month when investigation started
  • fecha_inicio: Date when investigation started .
  • geopoint: geo-location of crime.
  • colonia_hechos: Neighborhood where crime was committed.
  • calle_hechos: Cross-street 1
  • calle_hechos2: Cross-street 2
  • latitud: Latitude
  • longitud: Longitude

This first glance at the data reveals a few issues:

  • There is a lot of information that is repetitive (for example date time information, location information).
  • There is a lot of data that is unlikely to be used in this analysis.
  • I also still some work needs to be done to identify crimes that are committed on different modes of transportation. What types of crimes are included?
  • The data needs to be translated into english. I will do this further down the line so that I only translate the variables I need.

Data cleaning

The objective of this section is to work towards a data set that has the variables I need in the right format to answer my questions.

First, I work to identify the crimes committed on any mode of transportation. For this, I first look at the unique values of the Delito(crime) variable for clues using unique(victimas$Delito). This reveals 277 unique categories. Manually sifting through these data I spotted some categories that include the names of different modes. To make sure I capture them all, I look for all the categories that mention modes of transportation.

#list of modes of transportation
modes <- c("METRO", "METROBUS", "TAXI", "PESERO", "TRANSPORTE PÚBLICO", "AUTOBÚS FORÁNEO", "ECOBUS", "RTP", "TROLEBUS", "TREN LIGERO", "TROLEBUS", "VEHICULO", "TREN SUBURBANO", "TAXI")

#regular expression to help identify crimes that include any of te elements in mode
modes_regex <- paste(modes, collapse = "|")

# check if any of the words in modes are inside the delito category names
unique(crimes$delito) %>% str_subset(modes_regex)
##  [1] "ROBO DE OBJETOS DEL INTERIOR DE UN VEHICULO"                
##  [2] "ROBO A PASAJERO / CONDUCTOR DE VEHICULO CON VIOLENCIA"      
##  [3] "ROBO DE VEHICULO DE SERVICIO PARTICULAR CON VIOLENCIA"      
##  [4] "ROBO DE VEHICULO DE PEDALES"                                
##  [5] "ROBO A PASAJERO A BORDO DE PESERO COLECTIVO CON VIOLENCIA"  
##  [6] "ROBO A NEGOCIO Y VEHICULO CON VIOLENCIA"                    
##  [7] "ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VIOLENCIA"      
##  [8] "ROBO A PASAJERO A BORDO DE METRO SIN VIOLENCIA"             
##  [9] "ROBO A PASAJERO A BORDO DE METRO CON VIOLENCIA"             
## [10] "ROBO A PASAJERO A BORDO DE METROBUS SIN VIOLENCIA"          
## [11] "ROBO DE VEHICULO DE SERVICIO PÚBLICO SIN VIOLENCIA"         
## [12] "TENTATIVA DE ROBO DE VEHICULO"                              
## [13] "ROBO DE VEHICULO DE SERVICIO PÚBLICO CON VIOLENCIA"         
## [14] "ROBO A PASAJERO A BORDO DE TRANSPORTE PÚBLICO CON VIOLENCIA"
## [15] "ROBO A PASAJERO A BORDO DE TRANSPORTE PÚBLICO SIN VIOLENCIA"
## [16] "ROBO A CASA HABITACION Y VEHICULO SIN VIOLENCIA"            
## [17] "ROBO A REPARTIDOR Y VEHICULO CON VIOLENCIA"                 
## [18] "ROBO A PASAJERO EN TROLEBUS SIN VIOLENCIA"                  
## [19] "ROBO A PASAJERO A BORDO DE TAXI CON VIOLENCIA"              
## [20] "ROBO A PASAJERO A BORDO DE TAXI SIN VIOLENCIA"              
## [21] "ROBO A NEGOCIO Y VEHICULO SIN VIOLENCIA"                    
## [22] "ROBO A PASAJERO / CONDUCTOR DE TAXI CON VIOLENCIA"          
## [23] "ROBO A PASAJERO EN RTP CON VIOLENCIA"                       
## [24] "ROBO DE VEHICULO DE SERVICIO OFICIAL SIN VIOLENCIA"         
## [25] "ROBO A PASAJERO EN ECOBUS CON VIOLENCIA"                    
## [26] "ROBO A PASAJERO A BORDO DE METROBUS CON VIOLENCIA"          
## [27] "ROBO A PASAJERO A BORDO DE PESERO COLECTIVO SIN VIOLENCIA"  
## [28] "PERDIDA DE LA VIDA POR SUICIDIO EN EL METRO"                
## [29] "ROBO A TRANSPORTISTA Y VEHICULO PESADO CON VIOLENCIA"       
## [30] "POSESION DE VEHICULO ROBADO"                                
## [31] "ROBO A CASA HABITACION Y VEHICULO CON VIOLENCIA"            
## [32] "PRIV. ILEGAL DE LA LIB. Y ROBO DE VEHICULO"                 
## [33] "ROBO DE VEHICULO DE SERVICIO DE TRANSPORTE CON VIOLENCIA"   
## [34] "ROBO DE VEHICULO DE SERVICIO OFICIAL CON VIOLENCIA"         
## [35] "ROBO DE VEHICULO DE SERVICIO DE TRANSPORTE SIN VIOLENCIA"   
## [36] "ROBO A TRANSPORTISTA Y VEHICULO PESADO SIN VIOLENCIA"       
## [37] "ROBO A PASAJERO EN TREN SUBURBANO CON VIOLENCIA"            
## [38] "ROBO A PASAJERO EN RTP SIN VIOLENCIA"                       
## [39] "ROBO A PASAJERO EN TREN LIGERO SIN VIOLENCIA"               
## [40] "ROBO A PASAJERO EN AUTOBÚS FORÁNEO CON VIOLENCIA"           
## [41] "ROBO A REPARTIDOR Y VEHICULO SIN VIOLENCIA"                 
## [42] "ROBO A PASAJERO EN TREN SUBURBANO SIN VIOLENCIA"            
## [43] "ROBO A PASAJERO EN ECOBUS SIN VIOLENCIA"                    
## [44] "ROBO A PASAJERO EN TROLEBUS CON VIOLENCIA"                  
## [45] "HOMICIDIO INTENCIONAL Y ROBO DE VEHICULO"                   
## [46] "ROBO A PASAJERO A BORDO DE PESERO Y VEHICULO CON VIOLENCIA" 
## [47] "LESIONES INTENCIONALES Y ROBO DE VEHICULO"                  
## [48] "ROBO A TRANSEUNTE Y VEHICULO CON VIOLENCIA"                 
## [49] "ROBO A PASAJERO EN TREN LIGERO CON VIOLENCIA"               
## [50] "VIOLACION EQUIPARADA Y ROBO DE VEHICULO"                    
## [51] "ROBO DE VEHICULO Y NOMINA CON VIOLENCIA"                    
## [52] "VIOLACION Y ROBO DE VEHICULO"

This is a good approximation, but I still end up with many categories that don’t reflect the type of crimes that I am looking for, which identifies crimes on people using different modes of transportation. However, a close look at this data reveals that the categories I am interested include the term passenger (PASAJERO)

unique(crimes$delito) %>% str_subset("PASAJERO")
##  [1] "ROBO A PASAJERO / CONDUCTOR DE VEHICULO CON VIOLENCIA"      
##  [2] "ROBO A PASAJERO A BORDO DE PESERO COLECTIVO CON VIOLENCIA"  
##  [3] "ROBO A PASAJERO A BORDO DE METRO SIN VIOLENCIA"             
##  [4] "ROBO A PASAJERO A BORDO DE METRO CON VIOLENCIA"             
##  [5] "ROBO A PASAJERO A BORDO DE METROBUS SIN VIOLENCIA"          
##  [6] "ROBO A PASAJERO A BORDO DE TRANSPORTE PÚBLICO CON VIOLENCIA"
##  [7] "ROBO A PASAJERO A BORDO DE TRANSPORTE PÚBLICO SIN VIOLENCIA"
##  [8] "ROBO A PASAJERO EN TROLEBUS SIN VIOLENCIA"                  
##  [9] "ROBO A PASAJERO A BORDO DE TAXI CON VIOLENCIA"              
## [10] "ROBO A PASAJERO A BORDO DE TAXI SIN VIOLENCIA"              
## [11] "ROBO A PASAJERO / CONDUCTOR DE TAXI CON VIOLENCIA"          
## [12] "ROBO A PASAJERO EN RTP CON VIOLENCIA"                       
## [13] "ROBO A PASAJERO EN ECOBUS CON VIOLENCIA"                    
## [14] "ROBO A PASAJERO A BORDO DE METROBUS CON VIOLENCIA"          
## [15] "ROBO A PASAJERO A BORDO DE PESERO COLECTIVO SIN VIOLENCIA"  
## [16] "ROBO A PASAJERO EN TREN SUBURBANO CON VIOLENCIA"            
## [17] "ROBO A TRANSEUNTE EN TERMINAL DE PASAJEROS CON VIOLENCIA"   
## [18] "ROBO A PASAJERO EN RTP SIN VIOLENCIA"                       
## [19] "ROBO A PASAJERO EN TREN LIGERO SIN VIOLENCIA"               
## [20] "ROBO A PASAJERO EN AUTOBÚS FORÁNEO CON VIOLENCIA"           
## [21] "ROBO A PASAJERO EN TREN SUBURBANO SIN VIOLENCIA"            
## [22] "ROBO A PASAJERO EN ECOBUS SIN VIOLENCIA"                    
## [23] "ROBO A PASAJERO EN TROLEBUS CON VIOLENCIA"                  
## [24] "ROBO A PASAJERO A BORDO DE PESERO Y VEHICULO CON VIOLENCIA" 
## [25] "ROBO A PASAJERO EN AUTOBUS FORANEO SIN VIOLENCIA"           
## [26] "ROBO A PASAJERO EN TREN LIGERO CON VIOLENCIA"

With this information, I am ready to subset the data. I notice that the crimes reported by passengers of different modes of transportation are all related to robberies. There is also information on whether the robbery was violent and non-violent.

# filter out crimes that include the word "PASAJERO" but not "TRANSEUNTE"
transport_crimes <- crimes %>% filter(str_detect( delito, "PASAJERO") & !str_detect( delito, "TRANSEUNTE"))

head(transport_crimes)
## # A tibble: 6 x 18
##   ao_hechos mes_hechos fecha_hechos        delito categoria_delito fiscalia
##       <dbl> <chr>      <dttm>              <chr>  <chr>            <chr>   
## 1      2018 Abril      2018-04-01 05:10:00 ROBO … DELITO DE BAJO … INVESTI…
## 2      2018 Marzo      2018-03-31 10:40:00 ROBO … ROBO A PASAJERO… INVESTI…
## 3      2014 Abril      2014-04-02 07:00:00 ROBO … ROBO A PASAJERO… INVESTI…
## 4      2018 Marzo      2018-03-26 19:10:00 ROBO … ROBO A PASAJERO… INVESTI…
## 5      2018 Abril      2018-04-02 01:30:00 ROBO … DELITO DE BAJO … INVESTI…
## 6      2018 Marzo      2018-03-24 10:30:00 ROBO … ROBO A PASAJERO… INVESTI…
## # … with 12 more variables: agencia <chr>, unidad_investigacion <chr>,
## #   alcaldia_hechos <chr>, colonia_hechos <chr>, ao_inicio <dbl>,
## #   mes_inicio <chr>, fecha_inicio <dttm>, calle_hechos <chr>,
## #   calle_hechos2 <chr>, longitud <dbl>, latitud <dbl>, geopoint <chr>

The data set includes many variables I don’t need so I select only the variables I am interested in. Based on my research questions I select the following variables and rename them in english to ease the rest of the analysis: crime = delito, crime_date = fecha_hechos, lat = latitud, lon = longitude, municipality = alcaldia_hechos, neihborhood= ColoniaHechos:

#select desired variables and translate variable names
transport_crimes <- transport_crimes %>% select(crime = delito, 
                                                      datetime = fecha_hechos, 
                                                      lat = latitud, 
                                                      lon = longitud,
                                                      municipality = alcaldia_hechos, 
                                                      neighborhood = colonia_hechos) 

The crime variable contains two important pieces of information that still needs to be separated. The category includes - the mode of transportation where victims were robbed - whether the robbery was violent. To sparate this informetion I create two additional categorical variables. transp_mode to indicate the mode, and violence, which indicates whether or not the robbery was violent. With this information parsed out, I can also remove the crime variable from my data set.

transport_crimes <- transport_crimes %>% 
                      #create categorical transp_mode variable. 
                        mutate( transp_mode = factor(case_when(
                            is.na(crime) ~ NA_character_,
                            str_detect(crime, "METROBUS") ~ "METROBUS",
                            str_detect(crime, "METRO") ~ "METRO",
                            str_detect(crime, "PESERO") ~ "PESERO",
                            str_detect(crime, "TRANSPORTE PÚBLICO") ~ "TRANSPORTE PUBLICO",
                            str_detect(crime, "AUTOBÚS FORÁNEO") ~ "AUTOBUS FORANEO",
                            str_detect(crime, "AUTOBUS FORANEO") ~ "AUTOBUS FORANEO",
                            str_detect(crime, "ECOBUS") ~ "ECOBUS",
                            str_detect(crime, "RTP") ~ "RTP",
                            str_detect(crime, "TREN LIGERO") ~ "TREN LIGERO",
                            str_detect(crime, "TROLEBUS") ~ "TROLEBUS",
                            str_detect(crime, "VEHICULO") ~ "VEHICULO",
                            str_detect(crime, "TREN SUBURBANO") ~ "TREN SUBURBANO",
                            str_detect(crime, "TAXI") ~ "TAXI"
                          ))) %>% 
                    # create categorical variable for violence
                      mutate( violence = factor(case_when(
                        is.na(crime) ~ NA_character_,
                        str_detect(crime, "CON VIOLENCIA") ~ "Yes",
                        str_detect(crime, "SIN VIOLENCIA") ~ "No"
                      ))) %>% 

                    # remove crime column from the data
                    select(-crime)

I check how many data points are in each transp_mode category to see if I should create a an OTHER category. I also note that there is a category called TRANSPORTE PUBLICO where I don’t know the specific mode.

 transport_crimes %>% group_by(transp_mode) %>% count() 
## # A tibble: 12 x 2
## # Groups:   transp_mode [12]
##    transp_mode            n
##    <fct>              <int>
##  1 AUTOBUS FORANEO       60
##  2 ECOBUS                45
##  3 METRO              10037
##  4 METROBUS            2867
##  5 PESERO              3467
##  6 RTP                  124
##  7 TAXI                2819
##  8 TRANSPORTE PUBLICO  5752
##  9 TREN LIGERO           89
## 10 TREN SUBURBANO        22
## 11 TROLEBUS             177
## 12 VEHICULO           10542

I collapse the transp_mode categories that have less then 100 data points and the unspecified public transport category into a new OTHER Category. I will also use this step to translate the names of these categories.

transport_crimes <- transport_crimes %>% mutate(transp_mode = fct_recode(transp_mode, Metro = "METRO", 
                                                                    Car = "VEHICULO", 
                                                                    Pesero =  "PESERO",
                                                                    Other = "TRANSPORTE PUBLICO", 
                                                                    Taxi =  "TAXI", 
                                                                    Other = "RTP", 
                                                                    Metrobus  = "METROBUS", 
                                                                    Other = "TREN LIGERO", 
                                                                    Other = "TROLEBUS", 
                                                                    Other = "AUTOBUS FORANEO",   
                                                                    Other = "TREN SUBURBANO",     
                                                                    Other = "ECOBUS"))

# smell test for recode
 transport_crimes %>% group_by(transp_mode) %>% count() 
## # A tibble: 6 x 2
## # Groups:   transp_mode [6]
##   transp_mode     n
##   <fct>       <int>
## 1 Other        6269
## 2 Metro       10037
## 3 Metrobus     2867
## 4 Pesero       3467
## 5 Taxi         2819
## 6 Car         10542

I check for NAs in each column.

sapply(transport_crimes, function(x) sum(is.na(x)))
##     datetime          lat          lon municipality neighborhood  transp_mode 
##            8          483          483          114          506            0 
##     violence 
##            0

I check on how the data is distributed over time

# group by years and count to check distribution of data over time
transport_crimes %>% select(transp_mode, datetime) %>% 
                      mutate(year = year(datetime)) %>% 
                      group_by(year) %>% count()
## # A tibble: 22 x 2
## # Groups:   year [22]
##     year     n
##    <dbl> <int>
##  1  1915     1
##  2  1955     1
##  3  1958     1
##  4  1969     1
##  5  1980     1
##  6  1986     2
##  7  1990     1
##  8  1993     1
##  9  2001     3
## 10  2007     1
## # … with 12 more rows

For practical purposes, I will focus on the the data for 2016-2019. These years contain a larger amount of reports and are the most recent years with a full year’s worth of data.

# Take transport crimes 
transport_crimes <- transport_crimes %>% 
                      mutate(week = week(datetime), year = year(datetime)) %>% 
                      filter (year > 2015)

Finally, I look at the data set to see what I ended up with.

head(transport_crimes)
## # A tibble: 6 x 9
##   datetime              lat   lon municipality neighborhood transp_mode violence
##   <dttm>              <dbl> <dbl> <chr>        <chr>        <fct>       <fct>   
## 1 2018-04-01 05:10:00  19.4 -99.2 ALVARO OBRE… LOMAS DE BE… Car         Yes     
## 2 2018-03-31 10:40:00  19.3 -99.1 IZTAPALAPA   LOS REYES C… Pesero      Yes     
## 3 2018-03-26 19:10:00  19.4 -99.1 VENUSTIANO … ROMERO RUBIO Metro       No      
## 4 2018-04-02 01:30:00  19.4 -99.1 IZTACALCO    MILITAR MAR… Car         Yes     
## 5 2018-03-24 10:30:00  19.4 -99.1 CUAUHTEMOC   CENTRO       Metro       Yes     
## 6 2018-03-27 19:30:00  19.4 -99.2 BENITO JUAR… NAPOLES      Metrobus    No      
## # … with 2 more variables: week <dbl>, year <dbl>

I am now in better shape to answer my research questions. In the following sections, As I explore the data, I continue to manipulate it as needed.

Analysis and visualization

What kinds of crimes are committed on Mexico City’s transportation system (both private cars and public transportation)?How do the crimes reported vary by mode?

The crimes included in the data set for modes of transportation are violent and non-violent robberies. I explore how these are distributed across modes for the entire time period (2016-2019).

# I group by mode and count so that my categorical variable is ordered by the frequency of crimes
mode_levels <- transport_crimes %>% group_by(transp_mode) %>% count() %>% arrange(n) %>% pull(transp_mode)

# I make a simple frequency plot to vizualize how many crimes are reported for each mode. I separate these by violent and non-violent crime.
transport_crimes %>% select(transp_mode, violence, datetime) %>% 
                      mutate(transp_mode = factor(transp_mode, levels=rev(mode_levels))) %>% 
                      filter(!is.na(transp_mode)) %>% 
                      ggplot(aes(x=transp_mode, fill = violence)) + 
                      geom_bar() + 
                      #geom_text(stat = "count", aes(label = ..count.., y = ..count..),position=position_stack(0.5)) +
                      theme(axis.text.x = element_text(angle = 30)) +
                      theme_minimal() + 
                      labs(
                      title = "Number of reported robberies by mode of transport",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      xlab("Mode") +
                      ylab("Number of reports") +
                      scale_fill_discrete(name = "Violent?") +
                      theme(legend.position= c(0.2,-0.12), legend.direction="horizontal")

  • Have reported crimes increased or decreased over time? Does this vary by mode?*
transport_crimes %>%  
  select(transp_mode, datetime, violence) %>% 
  mutate(day = date(datetime)) %>% 
  group_by(day, violence, transp_mode) %>% count() %>% ungroup() %>%
  group_by(transp_mode, violence) %>%
  arrange(day) %>% # Make sure days are in order for rolling mean
  mutate(avg_n_7 = rollmean(n, 14, fill = c(NA, "extend", NA)))  %>% # 14 day rolling mean
                    ggplot(aes(x = day, y = avg_n_7, color = violence, group = violence)) + geom_line() + 
                    facet_grid(transp_mode ~ .) +
                    scale_x_date(date_breaks = "3 months", date_labels = "%b-%Y") +
                    theme_minimal() +
                    ylab("Number of reports") + xlab("Time") +
                    theme_minimal() +
                    theme(axis.text.x = element_text(angle = 90)) +
                      labs(
                      title = "Number of reported robberies over time by mode of transport",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      theme(legend.position= c(0.2,-0.3), legend.direction="horizontal") +
                      scale_color_discrete(name = "Violent?") +
                      theme(strip.text = element_text(size = 7))

 transport_crimes %>% 
                    mutate(month = as.factor(month(datetime)), year = as.factor(year(datetime))) %>% 
                    group_by(year, month, transp_mode) %>%
                    count(month) %>%
                    ggplot(aes(x= month, y = n, group = year, color = year))  + geom_line() + facet_wrap(~transp_mode) +
                    theme_minimal() +
                    labs(title = "Number of reported robberies by month and year",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      xlab("Month") +
                      ylab("Number of reports") + 
                    scale_x_discrete(labels=c("1" = "January", "2" = "February", "3" = "March",
                                              "4" = "April", "5" = "May", "6" = "June", "7" = "July",
                                              "8" = "August",  "9" = "September" , "10" = "October", "11" = "November", "12" = "December"))  + 
                    theme(axis.text.x = element_text(angle = 90)) + theme(legend.position= c(0.15,-0.3), legend.direction="horizontal")

What are the months, days of the week and hours when crimes are more frequently committed on different modes of transportation?

transport_crimes %>% mutate(month = as.factor(month(datetime)), year = as.factor(year(datetime))) %>% 
                    group_by(violence, month, transp_mode, year) %>%
                    ggplot(aes(x= month, group = violence, fill = violence))  + geom_bar() + facet_grid(year~transp_mode) +
                    theme_minimal() + labs(title = "Number of reported robberies by month and year",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      ylab("Number of reports") + 
                    scale_x_discrete(labels=c("1" = "January", "2" = "February", "3" = "March",
                                              "4" = "April", "5" = "May", "6" = "June", "7" = "July",
                                              "8" = "August",  "9" = "September" , "10" = "October", "11" = "November", "12" = "December"))  + 
                    theme(axis.text.x = element_text(angle = 90)) + theme(legend.position= c(0.15,-0.25), legend.direction="horizontal") +
                    scale_fill_discrete(name = "Violent?") + theme(axis.text=element_text(size=7)) +
                    theme(axis.title.x = element_blank())

transport_crimes %>% mutate(month = as.factor(month(datetime)), year = as.factor(year(datetime)), weekday = as.factor(wday(datetime))) %>%
                    group_by(violence, weekday, transp_mode) %>%
                    ggplot(aes(x= weekday, group = violence, fill = violence))  + geom_bar() + facet_grid(year~transp_mode) +
                    theme_minimal() +
                    scale_x_discrete(labels=c("1" = "Sunday", "2" = "Monday", "3" = "Tuesday",
                                              "4" = "Wednesdy", "5" = "Thursday", "6" = "Friday", "7" = "Saturday")) + 
                    theme(axis.text.x = element_text(angle = 90)) + 
                    theme(legend.position= c(0.15,-0.24), legend.direction="horizontal") +
                    scale_fill_discrete(name = "Violent?") +
                     theme_minimal() + 
                      labs(title = "Number of reported robberies by weekday and year",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      ylab("Number of reports") + 
                    theme(axis.text.x = element_text(angle = 90)) + theme(legend.position= c(0.15,-0.3), legend.direction="horizontal") +
                    scale_fill_discrete(name = "Violent?") +
                    theme(axis.title.x = element_blank())

 transport_crimes %>%  select(transp_mode, datetime, violence) %>% 
                      mutate(hour = hour(datetime), year = year(datetime)) %>% 
                     ggplot(aes(x = hour, fill = violence)) + geom_bar() +
                    facet_grid(year~transp_mode) +
                    theme_minimal() +
                    theme(strip.text.x = element_text(size = rel(0.6))) +
                    ylab("Number of crimes") + 
                    theme(axis.title.x = element_blank()) +
                    theme_minimal() + 
                      labs(title = "Number of reported robberies by hour of day and year",
                      subtitle = "Mexico City 2016-2019",
                      caption = "Source: Carpetas de Investigacion PGJ, Ciudad de México") +
                      ylab("Number of reports") + 
                      scale_x_continuous(breaks = c(0,5,10,15,20), labels= c(`0` = "00:00", `5` = "5:00", 
                                                                             `10` = "10:00", `15`= "15:00", `20` = "20:00")) + 
                    theme(axis.text.x = element_text(angle = 90)) +
                    theme(axis.title.x = element_blank()) + theme(axis.text=element_text(size=7)) + 
                    theme(legend.position= c(0.15,-0.15), legend.direction="horizontal") +
                    scale_fill_discrete(name = "Violent?")

For crimes committed on the Metro what are the most common stations where they are committed?

To explore questions related to locations, I will use the additional data sets that include shapefiles for Mexico City Neighborhoods, Transit (Metro and Metrobus) Lines, and Transit Stations.

# Load neighborhood ("colonias") shapefiles
colonias <- st_read("../data/coloniascdmx.shp") %>% select(name = nombre, geometry) %>% st_transform(4485)
## Reading layer `coloniascdmx' from data source `/Users/rebecadebuen/Dropbox/Projects/carpetas_investigacion/data/coloniascdmx.shp' using driver `ESRI Shapefile'
## replacing null geometries with empty geometries
## Simple feature collection with 1812 features and 7 fields (with 4 geometries empty)
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -99.35038 ymin: 19.12798 xmax: -98.94565 ymax: 19.59378
## CRS:            4326
# glance at the data
head(colonias)
## Simple feature collection with 6 features and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 1742411 ymin: 2177527 xmax: 1752575 ymax: 2192076
## CRS:            EPSG:4485
##                                      name                       geometry
## 1                    LOMAS DE CHAPULTEPEC POLYGON ((1743493 2191136, ...
## 2 LOMAS DE REFORMA (LOMAS DE CHAPULTEPEC) POLYGON ((1742588 2189493, ...
## 3                    DEL BOSQUE (POLANCO) POLYGON ((1744732 2191762, ...
## 4              PEDREGAL DE SANTA URSULA I POLYGON ((1752272 2179507, ...
## 5                                AJUSCO I POLYGON ((1750836 2180604, ...
## 6               VISTAS DEL MAUREL (U HAB) POLYGON ((1749303 2177713, ...
# make lat and lon into gemetrical object sf can work with
transport_crimes_geolocated <- transport_crimes %>% filter(!is.na(lat)) %>% st_as_sf(coords = c("lon", "lat"), crs=4326) %>% st_transform(4485)
metro_lines <- st_read("../data/lineas-de-metro/lineas-de-metro.shp") %>% st_transform(4485) %>% 
                    mutate(line = fct_recode(name, "Line 1" = "Línea 1", 
                                               "Line 2" = "Línea 2",
                                               "Line 3" = "Línea 3",
                                               "Line 4" = "Línea 4",
                                               "Line 5" = "Línea 5",
                                               "Line 6" = "Línea 6",
                                               "Line 7" = "Línea 7",
                                               "Line 8" = "Línea 8",
                                               "Line 9" = "Línea 9",
                                               "Line A" = "Línea A",
                                               "Line B" = "Línea B",
                                               "Line 12" = "Línea 12")) %>% 
                                              select(-name)
## Reading layer `lineas-de-metro' from data source `/Users/rebecadebuen/Dropbox/Projects/carpetas_investigacion/data/lineas-de-metro/lineas-de-metro.shp' using driver `ESRI Shapefile'
## Simple feature collection with 12 features and 3 fields
## geometry type:  LINESTRING
## dimension:      XY
## bbox:           xmin: -99.21479 ymin: 19.29063 xmax: -98.96099 ymax: 19.53445
## CRS:            4326
metro_stations <- st_read("../data/estaciones-metro/estaciones-metro.shp") %>% st_transform(4485) %>% st_buffer(dist = units::as_units(250, "m"))
## Reading layer `estaciones-metro' from data source `/Users/rebecadebuen/Dropbox/Projects/carpetas_investigacion/data/estaciones-metro/estaciones-metro.shp' using driver `ESRI Shapefile'
## Simple feature collection with 195 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -99.21464 ymin: 19.2868 xmax: -98.96062 ymax: 19.53444
## CRS:            4326
ggplot(metro_stations) + geom_sf()

First, I plot the points where crimes are reported over the metro lines

metro_crimes <- transport_crimes_geolocated %>% filter(transp_mode == "Metro")


cols <- c("Line 1" = "hotpink1", "Line 2" = "blue", "Line 3" = "olivedrab", 
          "Line 4" = "cadetblue1", "Line 5" = "yellow", "Line 6" = "red",
          "Line 7" = "orange", "Line 8" = "darkgreen", "Line 9" = "brown", 
          "Line A" = "purple", "Line B" = "grey69", "Line 12" = "darkgoldenrod")

violence.labs <- c(Yes = "Violent", No = "Non-violent")


ggplot(colonias) + 
  geom_sf(size = 0.1) + 
  geom_sf(data = metro_lines, aes(color = line), size = 1) + scale_colour_manual(values = cols) + 
  geom_sf(data=metro_crimes,  color = "black", size = 0.4, alpha = 0.5) +
    facet_wrap(~violence, labeller = labeller(violence = violence.labs)) +
    theme_void() + theme(legend.title = element_blank()) +
  theme(legend.position= "bottom", legend.direction="horizontal")

colonias_crimes <- colonias %>% st_join(metro_crimes) %>% filter()

colonias_counts <- colonias_crimes %>% 
  filter(!is.na(name)) %>% 
  st_drop_geometry() %>% 
  group_by(name) %>% 
  summarize(n = n()) %>% 
  full_join(colonias %>% 
              filter(!is.na(name)) %>%
              group_by(name) %>% 
              slice(1L) %>% 
              select(name, geometry)) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  st_as_sf()
## Joining, by = "name"
colonias_counts %>% ggplot(aes(fill = n)) + 
  geom_sf(size = 0.1) + 
  scale_fill_gradient(low = "white", high = "black", 
                      na.value = NA, trans = "log", breaks = c(1, 7, 50, 400)) + theme_void() +
    geom_sf(data = metro_lines, aes(color = line), size = 0.5, inherit.aes = FALSE) +
  scale_colour_manual(values = cols) +
     theme(legend.title = element_blank()) +
  theme(legend.position= "bottom", legend.direction="horizontal")

station_crimes <- metro_crimes %>% st_join(metro_stations)
station_counts <- station_crimes %>% 
  st_drop_geometry() %>% 
  group_by(violence, stop_name) %>% 
  summarize(n = n()) %>% arrange(desc(n, by_group = TRUE)) %>%
  mutate(stop_name = str_remove_all(stop_name, "[_1234567890]")) %>% 
  ungroup()

top_violent_stations <- station_counts %>% select(stop_name, violence, n) %>% 
  filter(!is.na(stop_name), violence == "Yes") %>% 
  group_by(stop_name) %>% slice(1)  %>% 
  arrange(desc(n, by_group = TRUE)) %>% 
  ungroup() %>% 
  select(-violence) %>% 
  top_n(10)

top_nonviolent_stations <- station_counts %>% select(stop_name, violence, n) %>% 
  filter(!is.na(stop_name), violence == "No") %>% 
  group_by(stop_name) %>% 
  slice(1)  %>% 
  arrange(desc(n, by_group = TRUE)) %>% 
  select(-violence) %>% 
  top_n(10)

For other cars, what neighborhoods are the most dangerous?

car_crimes <- transport_crimes_geolocated %>% filter(transp_mode == "Car")

ggplot(colonias) + 
  geom_sf(size = 0.1) + 
  geom_sf(data=car_crimes, color = "red", size = 0.1, alpha = 0.3) +
  #geom_sf(data = metrobus_stations, color = "blue", size = 0.5) +
  theme_void()

colonias_crimes <- colonias %>% st_join(car_crimes)

colonias_counts <- colonias_crimes %>% 
  filter(!is.na(name)) %>% 
  st_drop_geometry() %>% 
  group_by(name) %>% 
  summarize(n = n()) %>% 
  full_join(colonias %>% 
              filter(!is.na(name)) %>%
              group_by(name) %>% 
              slice(1L) %>% 
              select(name, geometry)) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  st_as_sf()
## Joining, by = "name"
colonias_counts %>% ggplot(aes(fill = n)) + 
  geom_sf(size = 0.1) + 
  scale_fill_gradient(low = "yellow", high = "red", na.value = NA, trans = "log", breaks = c(1, 7, 50)) + 
  theme_void() 

colonias_crimes <- colonias %>% st_join(transport_crimes_geolocated)

colonias_counts <- colonias_crimes %>% 
  filter(!is.na(name)) %>% 
  st_drop_geometry() %>% 
  group_by(name) %>% 
  summarize(n = n()) %>% 
  full_join(colonias %>% 
              filter(!is.na(name)) %>%
              group_by(name) %>% 
              slice(1L) %>% 
              select(name, geometry)) %>%
  mutate(n = ifelse(is.na(n), 0, n)) %>%
  st_as_sf()

colonias_counts %>% ggplot(aes(fill = n)) + 
  geom_sf(size = 0.1) + 
  scale_fill_gradient(low = "yellow", high = "red", na.value = NA, trans = "log", breaks = c(1, 7, 50, 400)) + 
  theme_void()